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1.  Introduction 


1.1  Objective 

This  report  focuses  on  the  design  of  a  new  family  of  local  anomaly  detectors  for  hyperspectral 
sensor  imagery  (HSI).  These  detectors  employ  an  indirect  sample  comparison  via  the 
mathematics  of  semiparametric  statistics  and  large  sample  theory  to  test  the  likelihood  that  local 
HSI  random  samples  belong  to  the  same  population,  or  class.  The  employed  notion  is  not  based 
on  a  physical  motivation  but  on  a  statistical  one,  and  it  is  proposed  as  a  viable  alternative  to 
testing  a  two-sample  hypothesis  using  conventional  methods.  The  aim  is  to  achieve  a  significant 
reduction  of  meaningless  detections  via  unsupervised  learning  methods,  while  maintaining  a 
high  probability  of  meaningful  detections.  The  notion  I  seek  to  implement  is  relatively  simple, 
but  some  of  the  mathematics  presented  in  this  report  are  nontrivial,  as  Normality  is  not  assumed 
in  the  models. 

1.2  Survey  of  Prior  Art 

Hyperspectral  sensors  are  passive  sensors  that  simultaneously  record  images  for  many 
contiguous  and  narrowly  spaced  regions  of  the  electromagnetic  spectrum.  A  data  cube  is  created 
from  these  images  in  which  each  image  corresponds  to  the  same  ground  scene  and  contains  both 
spatial  and  spectral  infonnation  about  objects  and  backgrounds  in  the  scene.  These  sensors 
employ  several  bands  and  have  been  used  in  various  fields  including  urban  planning,  mapping, 
and  military  surveillance.  Good  references  to  some  of  these  topics  in  the  context  of  remote 
sensing  and  descriptions  of  the  most  popular  sensors  since  the  mid  1960’s  are  included  in  the 

12  3 

reference  section. 

With  the  introduction  of  sensors  capable  of  high  spatial  and  spectral  resolution,  there  has  been  an 
increasing  interest  in  using  spectral  imagery  to  detect  objects  or  features  of  interest  (targets). 
However,  detection  algorithms  that  presume  a  target  signature  are  subject  to  signal  mismatch 
losses  because  of  the  complications  of  converting  sensor  data  into  material  spectra.4  Target 
matching  approaches  are  further  complicated  by  the  large  number  of  possible  targets  and 
uncertainty  as  to  the  reflectance/emission  spectra  of  these  objects.  For  example,  the  surface  of  a 
target  may  consist  of  several  materials,  and  the  spectra  may  be  affected  by  weathering  or  other 
unknown  factors.  One  may  be  interested  in  a  large  number  of  possible  objects  each  with  several 
signatures.  Thus,  the  multiplicity  of  possible  spectra  associated  with  targets  and  complications  of 

1  Schowengerdt,  1997. 

“  Coompbell,  1996. 

4  Lillesand,  1994. 

4  Schott,  1977. 
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atmospheric  compensation  have  led  to  the  development  and  application  of  anomaly  detectors 
that  seek  to  distinguish  observations  of  unusual  materials  from  typical  background  materials 
without  reference  to  target  signatures  or  target  subspaces.  Quite  often  objects  consisting  of 
unusual  materials  are  detected  as  local  anomalies  in  a  scene;  hence,  anomaly  detection  will  be 
used  interchangeably  in  this  paper  with  object  detection. 

Anomalies  are  defined  with  reference  to  a  model  of  the  background.  Background  models  are 
developed  adaptively  using  reference  data  from  either  a  local  neighborhood  of  the  test  pixel  or  a 
large  section  of  the  image.  Local  and  global  spectral  anomalies  are  defined  as  observations  that 
deviate  in  some  way  from  the  neighboring  clutter  background  and  from  the  overall  scene, 
respectively.  Both  approaches  have  their  merits.  The  local  spectral  anomaly  detector  is 
susceptible  to  false  alanns  that  are  isolated  spectral  anomalies.  An  algorithm  commonly  referred 
to  as  RX  algorithm,5  for  instance,  has  become  a  benchmark  for  multispectral  data,  based  on  this 
principle.  The  RX  algorithm  is  a  maximum  likelihood  anomaly  detection  procedure  that 
simplifies  the  clutter  to  being  spatially  white.  Researchers  have  also  adapted  some  classic 
approaches6  (e.g.,  Fisher’s  Linear  Discriminant  [FLD],  Principal  Component  Analysis  [PCA])  in 
the  same  spirit  of  the  RX  algorithm  to  anomaly  detection  in  hyperspectral  sensor  imagery. 

Global  spectral  anomaly  detection  algorithms,  on  the  other  hand,  are  not  susceptible  to  this  type 
of  clutter-generated  false  alarms.  However,  a  global  anomaly  detector  will  not  find  an  isolated 
target  in  the  open  if  the  signature  is  similar  to  that  of  previously  classified  background  material. 
Examples  of  global  detectors  are  the  global  normal  mixture  mode7  and  the  global  linear  mixture 
model.  A  comprehensive  comparative  analysis  among  different  approaches  for  HSI  anomaly 
detection  can  be  found  in. 9,10,1 1,12 

Earlier  laboratory  experiments  using  SAR  (synthetic  aperture  radar)  imagery  produced 
compelling  evidences  to  suggest  that  using  conventional  approaches  to  compare  sample  pairs, 
i.e.,  two-sample  hypothesis  tests  treating  the  two  samples  individually,  promotes  an  intolerable 
high  number  of  meaningless  detections  (false  alanns  [FA])  in  areas  characterized  by  region 
discontinuities  in  the  imagery.  At  that  time,  I  proposed  a  post-processing  method  from 
mathematical  morphology  to  account  for  those  observations  and  adapted  it  to  a  SAR  target- 
detection  system.13,14  The  approach  produced  some  surprisingly  good  results  using  real  SAR 
data. 

5  Yu,  1997 

6  Kwon,  2003. 

7  Stein,  2002 

8  Stein,  2002. 

9  Manolakis,  2002. 

10  Crist,  1999. 

1 1  Haskett,  1999. 

12 

Grossmann,  1998. 

13  Rosario,  1999. 

14  Rosario,  2000. 
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The  novelty  in  this  paper  is  the  mathematical  means  that  is  proposed  to  address  HSI  anomaly 
detection.  The  main  contribution  of  this  report  is  threefold:  (i)  a  recently  proposed  local  anomaly 
detector15  shall  be  discussed  for  the  first  time  using  extended  details;  I  shall  describe  a  suitable 
mathematical  model  ’  ’  ’  ’  that  elegantly  materializes  a  combining  idea  and  shall  study  the 
model’s  maximum  likelihood  method  and  its  asymptotic  behavior;  I  shall  design  an  effective 
local  anomaly  detector  based  on  the  model’s  asymptotic  behavior,  which  for  convenience  shall 
be  named  the  semiparametric  (SemiP)  algorithm;  (ii)  a  second  anomaly  detector  shall  be 
proposed  to  the  community,  an  approximation  to  the  semiparametric  (AsemiP)  algorithm,  which 
may  be  used  to  replace  the  complicated  equations  of  the  first  model’s  solution  with  simpler 
equations — yet  describing  the  same  phenomenon;  I  shall  state  a  proposition  of  the  second  model 
and  prove  its  statement.  Derivation  of  the  AsemiP  algorithm  is  motivated  by  the  SemiP' s  output 
properties,  not  by  the  semiparametric  model  itself — although,  its  derivation  is  also  based  on 
approximation  theorems  of  mathematical  statistics;  and  (iii)  in  order  to  promote  the  use  of 
models  whose  mathematics  are  based  on  the  statistical  assumption  of  independent,  identically 
distributed  random  samples,  an  inside/outside  window  mechanism  shall  be  introduced  aimed  at 
transfonning  local  HSI  information  into  independent  sample  pairs.  Comparative  results  are  also 
presented  between  SemiP,  AsemiP  and  other  approaches  commonly  used  with  hyperspectral  data. 


2.  Approach 


2.1  Problem  Formulation 

Figure  1  shows  a  hyperspectral  scene  that  consists  of  14  ground  vehicles  parked  across  a  grassy 
area  along  a  treeline.  For  surveillance  applications  in  similar  scenes,  human  analysts  do  well 
quickly  ignoring  most  of  the  imagery  and  concentrating  their  attentions  to  those  vehicles  and 
their  shadows.  Humans,  of  course,  use  both  global  and  local  infonnation  to  focus  their  attention 
to  meaningful  objects  in  the  scene,  a  capability  that  can  only  be  reproduced  by  applying,  for 
instance,  layers  of  unsupervised  learning  methods  complementing  each  other  to  perfonn  this 
single  task.  For  example,  a  suite  of  algorithms  that  includes  an  edge  detector,  an  edge  elongation, 
a  clustering  method,  and  a  morphological  size  test  would  probably  reproduce  the  humans’ 
performance  in  such  a  scene,  but  with  a  huge  cost:  computational  time. 


1 5  Rosario,  2004 

16  Qin,  1997. 

17  Fokianos,  2001 

1 8 

Anderson,  1972. 
19  Prentice,  1979. 

~  Cox,  1996. 
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Figure  1.  Nonhomogeneous,  multicomponent  scene  from  the  hyperspectral 
digital  imagery  collection  experiment  (F1YDICE).  (A  pixel  in  an 
HYDICE  imagery  is  represented  by  a  vector.)  Typically,  local  anomaly 
detectors  produce  an  intolerable  high  number  of  false  alarms 
(non-anomalies)  in  similar  scenes;  local  region  discontinuities  degrade 
detectors’  performances.  The  sampling  mechanism  is  discussed  in  the  text. 


My  interest  is  to  approach  humans’  performance  using  a  single  unsupervised  learning  algorithm 
that  functions  as  a  local  anomaly  detector.  Figure  1  illustrates  the  sampling  mechanism  I  propose 
to  transform  local  imagery  infonnation  into  sample  pairs  for  our  statistical  models.  It  introduces 
three  window  cells  from  which  samples  will  be  drawn  from  the  data.  These  windows  are  referred 
to  as:  test  cell,  reference  cell,  and  variability  cell.  Information  between  the  variability  and 
reference  cells  will  be  used  to  form  a  control  or  reference  feature  vector,  and  information 
between  the  variability  and  test  cells  will  be  used  to  form  an  unknown  test  feature  vector. 

The  test  cell  provides  a  spectral  sample  average  (//;)  from  a  (w  x  w)  window;  the  reference  cell 
provides  a  spectral  sample  average  (ju0)  from  M  vectors  surrounding  a  guard  region,  i.e.,  a  blind 
area  between  test  and  reference  cells  to  account  for  larger  than  (w  x  w)  targets;  and  the 
variability  cell  provides  J  individual  spectral  vectors  (v7)  each  consisting  of/c  =  spectral 
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responses  (%)  for  K  distinct  wavelengths  in  the  visible  to  SWIR  (shortwave  infrared)  region  of 
the  electromagnetic  spectrum,  i.e.,  region  from  0.4  jum  to  2.4  /jm. 


Hyperspectral  data  have  highly  correlated — hence,  dependent — spatial  and  spectral  clutter,  so  to 
promote  statistical  independence,  given  that  we  will  make  this  assumption  in  our  models,  I 
propose  to  apply  a  high-pass  (HP)  filter  in  the  spectral  domain,  thus  transforming  vj  into  A,  (see 
figure  1),  and  then  use  A,  to  compute  a  feature  that  promotes  spatial  independence.  The  feature  is 
known  as  spectral  angle  mapper  (SAM),  which  in  essence  computes  the  angle  between  two 
vectors,  or 


x,7  =—  arccos 

y  n 


A'.A 


A,  A„ 


«ll  J 


(1) 


where  A,-  =  Ty  -  Ay(k-i)  (A,-  is  a  [K-l]  x  1  vector);  Aui  =  AWtk  -  ^u/k-ij  (using  spectral  components 
from  vectors  juo  and  ///),  x,  =  [x;/] ;  i=0,l\ j=l,  ...,J  (J  is  the  total  number  of  vectors  in  the 
variability  cell);  xy  range  from  0  to  90  degrees  ( 0  representing  minimum  class  difference 
between  reference  and  test  samples  and  90  representing  the  maximum  class  difference  between 
these  samples);  the  operator  ||  z  ||  denotes  the  squared  root  of  z'z;  and  [*]'  denotes  the  vector 
transpose  operator. 

Let  xo  denote  the  reference  feature  vector,  xi  the  test  feature  vector,  and  let  both  vectors  be 
distributed  (~)  by  unknown  joint  distributions  /);  and  //,  respectively,  or 

*l  =  l>il  >  *l«j  1  ~  /i(*)  (2) 

*0  =  [ *01  >  •••  >  *o»0  ]  ~  fo  (*) 
where,  no  =  ni  =  J  in  this  particular  implementation. 

The  window  cells  are  expected  to  systematically  move  throughout  the  imagery  and  at  each 
location  this  question  will  be  posed:  Do  xo  and  x/  belong  to  the  same  population,  or  class,  in  the 
feature  space?  If  the  answer  is  no,  the  test  sample  will  be  labeled  as  an  anomaly  with  respect  to 
its  surroundings  at  that  location. 

A  conventional  two-sample  hypothesis  test  would  work  very  well  if  samples  x0  and  x;  do  belong 
to  distinct  classes  Co  and  Cj,  or  to  one  of  these  classes.  Problems  occur,  however,  when  one  of 
the  samples  (e.g.,  xo)  belongs  to  a  composite  class  consisting  of  both  classes  Co  and  Ci,  denote 
x0(C0C i),  and  then  it  is  compared  to  X/(C/).  In  those  cases,  standard  statistical  tests  may  reject  the 
hypothesis  that  x/ C/C/)  and  X/(C/)  belong  to  the  same  class.  This  rejection — correct  as  it  may 
seem — is  arguably  the  most  dominant  driving  force  affecting  the  number  of  FA  produced  by 
most — if  not  all — local  anomaly  detectors  using  sensor  imagery.  The  reason  is  that  region 
discontinuities  (e.g.,  boundaries  between  tree  clusters  and  their  shadows)  are  abundant  in  sensor 
imagery,  and  they  are  not  taken  into  account  in  conventional  statistical  models. 

21  Kelly,  April  1989. 
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2.2  Statistical-Motivated  Idea 

I  propose  an  indirect  comparison  approach  to  circumvent  the  problem  discussed  in  section  2.1. 
The  approach  is  not  based  on  a  physical  motivation  but  on  a  statistical  one.  The  key  is  not  to 
compare  samples  xo  and  xi  directly,  but  to  make  that  comparison  indirectly  by  constructing  a 
new  sample  z,  consisting  of  both  xq  and  xi,  and  then  by  comparing  z  in  some  form  to  either  xo  or 
xi.  To  clarify  this  notion,  consider  the  following  case  studies  when  comparing  xo  and  xp  ( 1 ) 
samples  belong  to  distinct  classes,  x0(Co)  andx/(C;);  (2)  samples  belong  to  a  single  class,  x0(Co) 
and  xi(Co)',  and  (3)  one  of  the  samples  holds  infonnation  from  two  classes  while  the  other  sample 
belong  to  one  of  these  classes,  xo{CoCi )  and  xi{Ci).  Now  consider  the  following,  where  U 
denotes  the  union  of  samples: 


Case  1 

Plausible  Result 

Conventional 

x0(Co)  =?x1(C1 ) 

No 

Proposed 

ziCoCj)  =  { x0(Co)  U Xl(Ci)}  =?x1(C1) 

No 

Case  2 

Plausible  Result 

Conventional 

x0(Co)  =?Xl(C0 ) 

Yes 

Proposed 

z(CoCo)  =  {xo(Co)  Ux,(Co)}  =?  x/(C0) 

Yes 

Case  3 

Plausible  Result 

Conventional 

xo(CoCj)  =?x1(C1) 

No 

Proposed 

z(C0C,C1)  =  {xo(CoCj)  U x;(C/)}  =?  Xl(C ,) 

Maybe 

It  is  plausible  that  the  proposed  comparison  approach  would  yield  the  same  results  produced  by 
conventional  methods  for  case  studies  1  and  2,  where  No  and  Yes  denote  anomalies  and  non- 
anomalies,  respectively.  In  particular,  z  in  Case  1  would  have  been  labeled  as  a  strong  anomaly 
in  respect  to  xj  for  the  same  reason  a  conventional  test  would  have  rejected  the  hypothesis  that  a 
composite  sample  (e.g.,  tree  and  shadow)  belongs  to  the  same  class  of  a  relatively  pure  sample 
(e.g.,  shadow).  In  Case  3,  however,  the  proposed  and  conventional  approaches  would  probably 
disagree  in  the  intensity  of  their  results  because  z(CoCiCi)  shows  more  evidence  of  Cj  than  does 
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xo(CoCi).  Sample  x/(C/)  would  seem  statistically  closer  to  z(CoCiCi)  than  to  xo(CoCi),  and  that 
difference  would  help  to  interpret  xi{Ci)  as  a  soft-anomaly  (labeled  in  Case  3  as  Maybe). 

My  goal  is  to  propose  statistical  models  that  are  capable  of  accentuating,  significantly,  local 
anomalies  (No)  from  soft-anomalies  (Maybe)  and  non-anomalies  (Yes).  Such  a  capability  would 
allow  a  detector  to  retain  a  high  probability  of  correct  detections  while  significantly  reducing  the 
number  of  nuisance  detections.  The  first  statistical  model  that  I  propose  combines  samples  by 
relating  in  some  form  the  probability  distribution  functions  of  xo  and  x/.  The  model  is  discussed 
next. 

2.3  Logistic  Model 

Let  vectors  Xk  have  their  components  independently,  identically  distributed  (iid).  Let  xo  be 
independent  ofx/.  And  consider  the  following: 


=  [xn,---,Xi,|  ]  iid  ~  g] (x) 

=  [x0l,---,X(K]  iid  ~  g0(x). 

(3) 

g][  \  =exp (a  +  j3h(x)), 
go(x) 

(4) 

where  gi  is  regarded  as  an  exponential  distortion  of  go  and  h(x)  is  an  arbitrary  but  known 
function  of  x.  Note  in  (3)-(4)  that  no  parametric  assumption  is  given  to  go  and  that  gi  depends 
only  on  the  unknown  parameters  a  and  /?,  hence,  justifying  the  model’s  name:  semiparametric. 

The  rationale  for  proposing  to  use  (4),  as  our  baseline,  is  that  many  common  distribution  families 
can  be  expressed  as  a  canonical  exponential  function.  These  families  fall  under  a  category  of 
probability  density  functions  called  exponential  families,  which  are  known  to  have  many  nice 
mathematical  and  statistical  properties.  (Some  of  these  properties  are  discussed,  for  instance, 
in.  )  One  of  these  mathematical  properties,  for  example,  is  that  an  exponential-family 
distribution  can  always  be  expressed  as  a  shift  of  another  exponential-family  distribution,  as 
shown  in  (4).  Reference  ~  provides  a  good  discussion  on  this  topic. 

Model  (3)-(4)  is  based  on  case-control  data  and  its  mathematical  development  depends  on  some 
of  the  advances  made  on  the  theory  of  semiparametric  inference.  Semiparametric  approaches  are 
commonly  used  in  analyzing  binary  data  that  arise  in  studying  relationships  between  disease  and 


22  Casella,  1990. 

23  Kay,  1987. 
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environment  of  genetic  characteristics.  ’  ’  Equation  4  has  its  roots  in  the  standard  logistic 
function  having  the  general  form 


P{z) 


1  +  exp(  y  +  (3z ) 


(5) 


where  y  is  a  scale  parameter  and  /?  interpreted  as  a  constant  rate  both  defining  a  proportion  P, 
which  is  dependent  on  a  variable  z.  The  logistic  function  was  invented  in  the  19th  century  for  the 
description  of  the  growth  of  populations  and  the  course  of  autocatalytic  chemical  reactions. 
Pierre-Francois  Verhulst  (1804-1849),  a  Belgian  statistician,  named5  as  the  logistic  function  and 

24 

published  his  suggestions  between  1838  and  1847.  For  an  elaborate  historical  account,  see. 

Using  the  independence  assumptions  in  model  (3)-(4),  with  h(x)=  x,  the  MLE  (maximum 
likelihood  estimate)  of  a  and  ft  can  be  attained  via  the  likelihood  function, 

"o  ”i 

C(cc,fi,g0)  =  n^o(^')nexP(«  +  ^Xi,)go(^)  1  ; 

i= 1  7=1 

/i=/ij+«0  /ij 

=  rwo  )J4  exp(ar  +  (3  xly). 

i= 1  7=1 

where,  n  =  n0+ni  and  the  combined  feature  vector  t  is 

1  ~  \-X\  1  ’ " "  ’  -fin,  ’  fill  ’ " '  ’  xOn0  ]  —  [h  ’ " "  ’  ]  •  (^) 


Notice  in  (6)  that  the  part  involving  go(t,)  (reference  distribution)  reflects  the  combined-sample 
property  of  a  model  we  sought,  and  the  part  involving  the  exponential  distortion  reflects  only  the 
property  of  the  sample  that  is  not  the  reference.  Both  properties  fit  well  into  the  proposed 
framework  of  merging  samples  and  then,  in  some  form,  comparing  this  combined  structure  with 
one  of  its  original  samples. 

Notice  also  that  go  in  (6)  is  unknown,  thus,  deriving  MLE  of  a  and  (3  via  standard  procedures 

cannot  be  attained.  However,  using  profiling  one  can  express  go  in  terms  of  a  and  (3  and  then 

replace  go  with  its  new  representation  back  in  (6).  Using  profiling,  the  method  of  Lagrange 

18. 

multiplier  was  proposed  to  attain  the  maximization  of  <^by  fixing  (a, (3)  and  then  maximizing 
^with  respect  to  go(h)  for  i=l, ...,/?,  subject  to  constraints 

Z£o(fi)  =  1>  gdh)>0,  (8) 

X  [exp(  a  +  (3tt )  -  l]g0  (tt )  =  0, 

5  Yu,  1997. 

1  8 

Anderson,  1972. 

19  Prentice,  1979. 

~  Cox,  1996. 

' >4 

Cramer,  2002. 
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where  the  last  constraint  reflects  the  fact  that  expi a+fix)go(x)  is  a  distribution  function. 
Following  this  approach,  it  can  be  shown16  that  the  maximum  value  of  <^is  attained  at 


SOV 'i  )  -  .  .  s’ 

n0  1  +  p  exp(  a  +  fitfi 

where  p=nfino,  and  that  ignoring  a  constant,  the  log-likelihood  function  is 

n\  n 

l(a,j3)=Yj(a  +  fixXi )  ■ -  £  1 log[l  +  p  exp(«  +  fit  ,•)] . 

7=1  i=l 

A  system  of  score  equations  that  maximizes  (10)  over  (a, ft)  is  shown  below, 

dl(a,fi)  _  pexp[a  +  fit t\  ^  =Q 

da  1  +  /?exp[a  +  fitt] 

cilia,  fi)  =  tjp  exp(  a  +  fi  tfi  +  y  x  _  Q 
Sfi  “)1  +  pexp(a  +  fitf)  ' 

Let  (a*, fif)  satisfy  ( 1 1)-(  12),  then  using  (9)  it  can  be  shown16  that  the  MLE  of  go(x)  is 


&  0  (h )  *  * 

1  +  pexp(a  +  fi  tj ) 


(9) 


(10) 


(11) 

(12) 


(13) 


2.4  SemiP  Algorithm 

In  this  subsection,  I  adapt  the  theory  in  section  2.3  to  the  framework  of  object  (anomaly) 
detection.  The  mathematics  for  this  adaptation  is  nontrivial,  as  we  employ  the  asymptotic 
behavior  of  ( a*,ff  )  under  a  model  that  does  not  assume  Normality.  We  strive  to  reach  a 
reasonable  balance  between  showing  the  most  important  parts  of  the  mathematical  development 
and  length  of  this  report. 

First,  notice  that  for  gfix)  =  exp  (a  +  fix)go(x)  to  be  a  density,  a  hypothesis  of  fi  =  0  in  (4)  must 
imply  a  =  0,  as  the  term  exp(a)  functions  as  a  normalizing  factor  so  that  g/  integrates  over  x  to  a 
total  mass  unity.  Second,  notice  that  the  hypothesis  Ho:  fi  =  0  (given  that  a  must  also  be  equal  to 
zero)  in  (4)  implies  that  a  test  population  and  a  reference  (control)  population  are  equally 
distributed,  namely,  gi  =  go.  With  this  logic,  one  can  design  an  anomaly  detector  from  the 
following  composite  hypothesis  test: 

H0:  fi  =  0  (g,  =g0)  anomaly  absent 
ll\  :  fi  ^  0  (g|  ^  g() )  anomaly  present. 


16  Qin,  1997. 
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Under  this  test,  local  regions  in  the  entire  imagery  would  be  individually  tested  to  reject  the  null 
hypothesis  {Ho)  yielding  in  the  process  a  binary  surface  of  values  of  1,  depicting  a  rejection  of 
Ho,  and  values  of  0,  depicting  a  non-rejection  of  Ho.  An  isolated  object  in  a  scene  would  be 
expected  to  produce  a  cluster  of  1  values  (anomalies)  in  the  resulting  binary  surface.  But  to 
design  the  hypothesis  test  in  (14),  one  must  know  the  asymptotic  behavior  of  the  extremum 
estimator  /T,  which  one  can  verify  (see  Appendix  A)  that  it  converges  to  Nonnality,  or 


f 


V 


Uo±A) 

Var(t)  ^ 


(15) 


Squaring  the  left  side  of  (15)  and  normalizing  the  result  by  the  asymptotic  variance,  one  can 
conclude  (see  convergence  results,  for  instance,  in22)  that  the  resulting  random  variable 

Z  =  np{\  +  p)~2 /?*V*(0-»  X\  (16^ 

converges  to  a  chi  square  distribution  with  1  degree  of  freedom,  where  V* (t)  estimates  Var(t). 


2.5  Approximating  SemiP  Performance:  AsemiP  Algorithm 

In  reference  to  (16),  there  are  two  major  factors  working  in  harmony  and  in  complementary 
fashion  to  promote  maximum  separation  between  signal  (anomalies)  and  noise  (non-anomalies), 
they  are:  the  squared  value  of  J3*  and  the  estimated  combined  variance,  V*(t),  which  is  also 
quadratic. 

These  factors  work  in  the  following  way:  When  two  samples  from  the  same  class  are  compared 
(i.e.,  is  H0:  (5=0  [ gi=go ]  true?),  the  term  ((f)2  tends  to  approach  zero  very  fast,  specially  for  (5* 
values  less  than  unity.  On  the  other  hand,  if  two  samples  from  distinct  classes  are  compared,  the 
term  V* (t)  tends  to  a  relatively  high  number,  also  very  fast,  asserting  the  fact  that  a  combined 
sample  vector  t  consists  of  components  belonging  to  distinct  populations. 

Motivated  by  these  properties,  I  shall  state  and  prove  an  approximation  algorithm  based  on  large 
sample  theory  that  replaces  complicated  SemiP  equations  with  simpler  ones  describing  the  same 
phenomenon. 

Proposition  1  (AsemiP  Algorithm) .  Let 

2  (17' 

*1  =[xn,---,xhll\beiid,  E{xXi)  =  px  ,Var{xXi)  =  <oo;  v 

*0  =Uoi ,---,x0nQ]beiid,  E{x0i)  =  ju0,  Var(x0i)  =  cqj  <co; 


assume  that  xo  and  xi  are  independent  and  that,  for  some  xq  and  xi,  the  combined  vector  t 


t  | - V[  | .V] - -V-0 ]  \t\  v4/7=7?|  +«o  ]  is  dd,  Var{t)  cr^<co, 


(18) 


22  Casella,  1990. 
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and  define 


a2 

cr2 

(19) 

P=P\  ~P0’  =  o’  Pi  = 

°0 

Under  some  regularity  conditions,  if  hypothesis  Hq:  (b  =  0;  2,  =  C, 

=  l  )  is  true  and 

P  =  x i-x0,  xi  =  ni  5>;Vt, 

i  =  0,1; 

(20) 

k= 1 

n 

n 

t  = 

7=1 

(21) 

V it)  =  X  (6  - 1 ) 2  •  g(x] ,  x0 ),  i i  =  ni+n0 

7=1 

,  where 

(22) 

g(x\,x0) 


(n- if 

n\  9  "0  7 
XOfi;  -*i)  +Z(x0;-x0) 


;  and 


P 


'1  P-1 

- 1 - 

v/ji  »oj 


|_z=l  i=l  J 

where,  fi  is  an  estimate  of  ft ,  then  the  random  variable 

X  =  p{n-\yli2V{t)^  Xi  W 

converges  in  distribution  to  a  chi-squared  distribution  with  1  degree  of  freedom. 


I  shall  make  some  insightful  comments  on  (23)  and  refer  the  reader  to  its  proof  in  Appendix  B. 

By  inspection,  one  should  readily  recognize  the  behavior  of  the  chosen  function  p  ,  in 

Proposition  1,  as  it  approximates  the  behavior  of  (5.  If  two  samples  from  the  same  population  are 
compared  using  (23),  the  estimate  of  p  ,  p  in  (20),  would  also  tend  to  approach  zero — as  the 
sample  size  increases,  and  tend  otherwise  for  samples  belonging  to  distinct  populations. 


The  real  challenge,  however,  is  to  derive  a  relatively  simple  estimate  of  Var(t),  as  defined  in 
(7A),  to  replace  V*(t),  as  shown  in  (16).  Var(t)  is  a  sum  of  squared  errors  individually  weighted 
by  their  probability  of  occurrence.  In  Proposition  1,  g(xi,xo)  is  proposed  to  provide  that 
probability  feature,  but  as  an  average  probability  of  occurrence,  instead.  In  this  sense,  comparing 
two  samples  from  distinct  populations  would  produce  very  high  cumulative  square  errors  using 
the  combined  vector  t,  but  appropriately  weighted  by  an  average  proportion.  The  proof  of 
Proposition  1  is  presented  in  Appendix  B. 

In  principle,  the  overall  behavior  of  (23)  seems  to  track  that  of  (16),  and  both  random  variables 
are  asymptotically  identically  distributed  under  X\  •  Note  that  the  A  semi  P's  performance  will  not 

asymptotically  approach  that  of  the  SemiP’s  performance,  as  the  number  of  samples  increases; 
the  former  approximates  the  general  behavior  of  the  latter,  i.e.,  it  promotes  a  high  separation 
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between  meaningful  signals  (isolated  objects)  from  noise  (homogeneous  and  non-homogenous 
local  regions). 


3.  Alternative  Techniques 


In  this  section,  I  make  a  few  general  comments  on  four  other  anomaly  detection  techniques, 
which  shall  be  used  in  this  paper  for  comparison  purposes:  their  mathematical  representations 
are  presented  in  this  section,  without  proofs,  and  a  reference  that  fully  describes  their 
implementation  issues  and  performances  in  the  HYDICE  dataset  will  be  cited. 

The  four  techniques  are  known  as:  RX  (reed-xiaoli),  PCA  (principal  component  analysis),  EST 
(eigen  separation  transfonn),  and  FLD  (Fisher’s  linear  discriminant).  These  techniques — or 
variations  of  them — are  commonly  used  in  the  hyperspectral  community.  The  RX  technique  is 
based  on  the  generalized  likelihood  ratio  test  and  on  the  assumption  that  the  population 
distribution  family  of  both  test  and  reference  samples  are  multivariate  normal.  The  FLD 
technique  is  also  based  on  the  same  assumption,  but  differs  in  its  subtleties  in  answering  the 
question  whether  the  test  and  reference  samples  are  drawn  from  the  same  normal  distribution. 

The  FLD  technique  promotes  separation  between  classes  and  variance  reduction  within  each 
class.  The  PCA  and  EST  techniques  are  both  based  on  the  same  general  principle,  i.e.,  data  are 
projected  from  their  original  high  dimensional  space  onto  a  significantly  lower  dimensional 
space  using  a  criterion  that  promotes  highest  sample  variability  within  each  domain  in  this  lower 
dimensional  space.  Differences  between  PCA  and  EST  are  better  appreciated  through  their 
mathematical  representations. 

These  four  techniques  were  implemented  with  the  conventional  inside-outside  window  approach, 
where  optimum  window  sizes  were  chosen  to  account  for  the  target-size  range  of  interest  (see 
section  4.1),  i.e.,  5x5  inside  window  embraced  by  a  9x9  outside  window  and  leaving  a  guard  area 
between  pixels  6  and  7,  inclusive,  from  the  center  of  the  inside  window.  These  techniques  are 
represented  by  the  following  set  of  equations: 


Sc°re  rx  -  (xin  xout )  Cout  (xin  xout) 

(24) 

Score  PCA  =  E\n  (xin  -  xout ) 

(25) 

Score  EST  =  EtAC(xin  -  xout  ) 

(26) 

ScoreFLD  =  ElSh ,  ^  (xin  -  xout ) 

(27) 

where  xin  is  a  sample  mean  vector  from  the  set  of  inside- window  vectors,  each  having  150 
spectral  bands  (see  section  7.1);  xout  is  similar  but  sampled  from  the  outside  window;  C~tla  is 
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the  inverse  sample  covariance  using  all  vectors  sampled  from  the  outside  window;  E/n  is  the 
highest  energy  eigenvector  of  the  eigenvector  decomposition  of  the  inside-window  covariance; 
E^c  is  the  highest  positive  energy  eigenvector  of  the  eigenvector  decomposition  of  the 

covariance  difference  (inside-widow  minus  outside- widow);  and  ErSjSw  is  the  eigenvector 
decomposition  of  the  scatter  matrices  ratio  SBSH)  ,  where 


N  _  _  M  _ 

Sw  =1(4* -Xin){x^-Xj  +£(*«  -xouMl, -XoJ 


i= 1 


N 


i=l 


M 


SB  -XtotaMrl  -XtoJ  +B4l-xtoJ(Ct-xtoJ 


(28) 

(29) 


i=l 


i= 1 


and  Xtotal  is  the  total  sample  average  using  all  samples  from  the  inside  and  outside  windows. 
For  additional  details  on  the  implementation  and  performance  of  these  techniques,  see.6 


4.  Results 


4,1  Data  Description 

An  experiment  was  carried  out  on  the  data  set  from  the  hyperspectral  digital  imagery  collection 
experiment  (HYDICE)  sensor.  The  HYDICE  sensor  records  210  spectral  bands  in  the  visible  to 
near  infrared  (VNIR)  and  short-wave  infrared  (SWIR)  and,  for  the  purpose  of  this  paper,  it  is 
sufficient  to  know  that  each  pixel  in  a  scene  is  actually  a  vector,  consisting  of  210  components. 

25 

An  extended  description  of  this  dataset  can  be  found,  for  example,  in. 

The  results  shown  in  this  section  for  one  data  sub-cube  are  representative  for  other  sub-cubes  in 
the  HYDICE  (forest  radiance)  dataset.  An  illustrative  sub-cube  (shown  as  an  average  of  150 
bands;  640  x  100  pixels)  is  shown  in  figure  1  (left).  (I  only  used  150  bands  by  discarding  water 
absorption  and  low  signal  to  noise  ratio  bands;  the  bands  used  are  the  23rd- 10 1st,  109th- 136th,  and 
152nd- 194th.)  The  scene  consists  of  14  stationary  motor  vehicles  (targets  near  a  treeline)  in  the 
presence  of  natural  background  clutter  (e.g.,  trees,  dirt  roads,  grasses).  Each  target  consists  of 
about  7x4  pixels,  and  each  pixel  corresponds  to  an  area  of  about  1.3  x  1.3  square  meters  at  the 
given  altitude.  The  main  goals  of  local  anomaly  detection  algorithms  on  these  types  of  scenes  are 
to  detect  objects  that  seem  clearly  anomalous  to  its  immediate  surroundings,  in  some 
predetermined  feature  space,  and  to  yield  in  the  process  a  tolerable  number  of  nuisance 
detections.  Targets  are  often  found  to  be  anomalous  to  its  immediate  surroundings. 

6  Kwon,  2003. 

“5  Schweizer,  2000. 
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4.2  Implementation  Notes 

I  now  present  some  helpful  hints  on  the  implementation  of  the  SemiP  and  AsemiP  algorithms. 

1.  SemiP  Anomaly  Detector 

•  Sampling  Mechanism:  Use  the  mechanism  described  in  section  2. 1  to  sample  a  pair  of 

random  feature  vectors  xy  (i  =  0  [reference],  1  [test];  j=l, _ ,J)  from  HSI.  I  used  a  9-pixel 

(3x3)  test  window,  a  56-pixel  reference  window,  and  a  60-pixel  variability  window,  as 
shown  in  figure  1 .  Note  that  the  size  of  the  variability  window  determines  the  size  of  the 
feature  vectors,  that  is,  xq,  and  xy  have  the  same  size,  J  =  60. 

•  Statistical  Independence:  An  attempt  should  be  made  to  promote  statistical  independence  in 
HSI.  See  discussion  in  section  2.1. 

•  Function  Maximization:  Perform  an  unconstrained  maximization  of  l(a,[3)  in  (10),  or 
minimization  of  [-l(a,[3)],  to  obtain  the  extremum  estimates  (a*,|3*).  For  this  report,  I 
used  one  of  the  standard  unconstrained  minimization  routines  available  in  MATLAB™ 
software  (i.e.,  fminsearch),  and  set  the  initial  values  of  (a,P)  to  (0,0). 

Variance  Under  the  Null  Hypothesis:  V*(t)  in  (16)  should  be  computed  using  (13)  and  a  discrete 
version  estimate  of  (7A): 

E{tk) = Mi);  v\t) =E(tk=2) -E2(tk=1y  (3 1;> 

•  Decision  Threshold:  Using  (16),  high  values  of  %  reject  hypothesis  H0,  hence,  detecting 
anomalies.  Set  a  decision  threshold  based  on  the  Type  I  error,  i.e.,  based  on  the  probability 
of  rejecting  H0  given  that  H0  is  true.  Using  a  standard  integral  table  for  the  chi  square 
distribution,  with  1  degree  of  freedom,  find  a  threshold  that  yields  an  acceptable 
probability  of  error  (e.g.,  0.001),  or  alternatively  find  and  use  a  suitable  threshold  that 
yields  a  value  at  the  knee  of  the  SemiP ’s  corresponding  ROC  curve. 

2.  AsemiP  Anomaly  Detector 

In  contrast  to  the  SemiP  algorithm,  the  AsemiP  algorithm  is  significantly  simpler  to  implement, 
as  the  latter  suite  does  not  require  specialized  subroutines  (unconstrained  minimization)  to 
perform  its  function.  Using  the  sampling  mechanism  described  in  section  2.1,  the  variables  in 
Proposition  1  are  straightforward  to  implement.  One  is  also  expected  to  promote  statistical 
independence  and  to  take  a  sufficiently  large  number  of  samples  (larger  than  30)  to  justify  the 
use  of  approximation  theorems  of  mathematical  statistics.  We  used  the  sampling  mechanism 
proposed  in  section  2. 1  to  obtain  a  pair  of  random  feature  vectors  Xy  (i  =  0  [ reference ],  1  \test\, 
j=l,  ...,J).  I  used  a  9-pixel  (3x3)  test  window,  a  56-pixel  reference  window,  and  a  60-pixel 
variability  window,  as  shown  in  figure  1,  where  J  =  60.  For  a  statistical  decision,  high  values 
obtained  by  using  (23),  or  equivalently  (14B),  reject  the  hypothesis  H0  in  Proposition  1,  thus, 
detecting  anomalies.  Set  a  decision  threshold  based  on  a  choice  of  type  I  error  using,  as  the  base 
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distribution,  the  chi-square  distribution  with  1  degree  of  freedom.  Alternatively,  find  and  use  a 
suitable  threshold  that  yields  a  value  at  the  knee  of  the  AsemiP ’s  corresponding  ROC  curve. 

4.3  Comparative  Results 

ROC  curves  will  be  used  in  this  subsection  as  a  means  to  quantitatively  compare  the  six 
approaches  described  in  this  paper. 

As  described  in  section  4.1,  the  set  of  14  ground  vehicles  near  the  treeline  in  figure  1  constitutes 
our  target  set.  However,  since  anomaly  detectors  are  not  designed  to  detect  a  particular  target 
set,  the  meaning  of  false  alarms  is  not  absolutely  clear  in  this  context.  For  instance,  a  genuine 
local  anomaly  not  belonging  to  the  target  set  would  be  incorrectly  labeled  as  a  false  alarm. 
Nevertheless,  it  does  add  some  value  to  our  analysis  to  compare  detections  of  targets  versus  non¬ 
targets  among  the  different  algorithms. 

Figure  2  shows  ROC  curves  produced  by  the  output  of  the  six  algorithms  on  the  HYDICE  data 
scene  shown  in  figure  1 .  Detection  perfonnance  was  measured  using  the  ground  truth 
information  for  the  HYDICE  imagery.  We  used  the  coordinates  of  all  the  rectangular  target 
regions  and  their  shadows  to  represent  the  ground  truth  target  set;  call  it  TargetTruth.  If  we 
denote  the  region  outside  the  TargetTruth  as  ClutterTruth,  then  the  intersection  between 
TargetTruth  and  ClutterTruth  is  zero  and  the  entire  scene  is  the  union  of  TargetTruth  and 
ClutterTruth.  In  this  paper,  for  a  given  decision  threshold,  the  proportion  of  target  detection  (PD) 
is  measured  as  the  proportion  between  the  number  of  detected  pixels  belonging  to  TargetTruth 
over  all  pixels  belonging  to  TargetTruth.  On  the  other  hand,  the  proportion  of  false  alarms  (PFA) 
is  measured  as  the  proportion  between  the  number  of  detected  pixels  belonging  to  ClutterTruth 
over  all  pixels  belonging  to  ClutterTruth. 


15 


Figure  2.  ROC  curves  for  the  HYDICE  data  scene  shown  in  figure  1.  The  SemiP  and  AsemiP  detectors  are 
noticeably  less  sensitive  to  different  decision  thresholds;  their  performances  almost  achieve  an 
ideal  ROC  curve  for  that  scene,  i.e.,  a  step  function  starting  at  point  (PFA=0,PD=1). 

In  general,  the  quality  of  a  detector  can  be  readily  assessed  by  noticing  a  key  feature  in  the  shape 
of  its  ROC  curve;  The  closer  the  knee  of  a  ROC  curve  is  to  the  PD  axis,  the  less  sensitive  the 
approach  is  to  different  decision  thresholds.  In  other  words,  PFA  does  not  change  significantly  as 
PD  increases.  (An  ideal  ROC  curve  resembles  a  step  function  that  starts  at  point 
[PFA=0,PD=1].)  As  it  can  be  readily  assessed  from  figure  2,  the  SemiP  and  AsemiP  detectors 
clearly  outperform  the  other  four  techniques  on  the  tested  scene.  That  difference  in  performance 
can  be  better  appreciated  in  figure  2  (right),  where  PFA  is  further  restricted  to  a  maximum  value 
of  0.02  compared  to  0.4  in  figure  2  (left).  Beyond  the  value  of  PFA  =  0.4,  these  ROC  curves 
reach  PD  near  1.0. 

The  significant  display  of  performance  shown  in  figure  2  by  the  proposed  algorithms  can  be 
further  appreciated  by  taking  a  closer  look  at  the  output  surfaces  produced  by  all  six  detectors,  as 
they  show  evidences  of  local  candidate  anomalies  in  the  scene.  The  intensity  of  local  peaks 
shown  in  figure  3  reflects  the  strength  of  the  evidence.  It  is  evident  from  figure  3  that  the 
surfaces  produced  by  RX,  PCA,  EST,  and  FLD  detectors  are  expected  to  be  significantly  more 
sensitive  to  changing  decision  thresholds  then  the  ones  produced  by  the  SemiP  and  AsemiP 
detectors. 
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Although  the  2-dim  (2-dimensional)  version  of  the  output  surfaces  shown  in  figure  3  displays 
useful  differences  among  the  responses  of  the  six  detectors,  they  do  not  make  full  justice  to  the 
quality  of  the  SemiP  and  AsemiP  detectors,  as  a  3-dim  perspective  of  their  surfaces  would.  A  3- 
dim  perspective  of  the  SemiP ’s  and  AsemiP ’s  output  surfaces  are  depicted  in  figure  4. 


Figure  3.  Decision  surfaces  for  a  HYDICE  data  scene  (far  left).  The  intensity  of  local  peaks  reflects  the  strength  of 
evidences  as  seen  by  different  anomaly  detectors.  Boundary  issues  were  ignored  in  this  test;  surfaces 
were  magnified  to  about  the  size  of  the  original  image  only  for  the  purpose  of  visual  comparison. 

Both  surfaces  in  figure  4  were  clipped  at  the  value  of  3,000,  but  some  of  their  values  do  continue 
to  significantly  higher  numbers.  The  dominant  (clipped)  peaks  are  the  results  produced  by  the 
fourteen  targets  near  the  treeline. 

Areas  containing  the  presence  of  clutter  mixtures  (e.g.,  edge  of  terrain,  edge  of  tree  clusters), 
where  other  methods  usually  yield  a  high  number  of  false  alarms  (false  anomalies),  are 
suppressed  by  the  new  approaches.  The  reason  for  this  suppression  is  that,  as  part  of  the  overall 
comparison  strategy,  the  reference  and  test  feature  vectors  are  not  compared  as  two  individual 
samples.  Instead,  a  new  sample  is  constructed — the  union  of  both  cells — and  compared  to  the 
test  sample.  This  indirect  comparison  approach,  which  is  inherent  in  both  detectors,  ensures  that 
a  component  of  a  mixture  (e.g.,  shadow)  shall  to  be  detected  as  a  local  anomaly  when  it  is  tested 
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Figure  4.  Decision  surfaces  ( 3-dim  version)  produced  by  the  SemiP  and  AsemiP  detectors,  their 
2-dim  versions  are  shown  in  figure  3.  The  dominant  peaks  represent  the  presence  of  the 
14  targets  parked  near  the  treeline.  Less-dominant  peaks  represent  areas  in  the  scene 
most  prone  to  cause  nuisance  detections  (e.g.,  region  discontinuity).  The  SemiP  and 
AsemiP  detectors  show  a  remarkable  ability  to  accentuate  genuine  local  anomalies  from 
their  surroundings. 

against  the  mixture  itself  (e.g.,  trees  and  shadows).  Performances  of  such  cases  are  represented  in 
figure  4  in  the  fonn  of  softer  anomalies  (significantly  less-dominant  peaks). 

It  is  evident  from  figure  3  and  figure  4  that  both  detectors  perform  remarkably  well  accentuating 
the  presence  of  dominant  local  anomalies  (e.g.,  targets  and  their  shadows)  from  softer  anomalies 
(e.g.,  region  discontinuity).  This  ability  explains  the  SemiP ’s  and  AsemiP ’s  superior  ROC-curve 
performances  shown  in  figure  2. 

Processing  Time:  For  completeness,  we  report  the  processing  time  in  minutes  (min)  for  a  cube  of 
size  640  x  100  (pixels)  x  150  (bands)  using  a  personal  computer  (CPU  speed:  1.80  GHz;  RAM 
memory:  1.0  Gbytes),  MATLAB™  software  (release  13),  and  three  detectors  (RX,  AsemiP,  and 
SemiP).  The  recorded  times  were:  20.6  min  (RX),  22.1  min  {AsemiP),  42.9  min  {SemiP). 
Computing  the  local  variance-covariance  matrix  and  its  inverse  dominated  the  RX  processing 
time.  Applying  locally  a  HP  filter  in  the  spectral  domain  and  applying  SAM  on  the  resulting 
vectors  dominated  the  AsemiP  processing  time.  And,  finally,  applying  locally  a  HP  filter  and  a 
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spatial  SAM,  and  using  an  unconstrained  minimization  routine  (see  section  4.3)  dominated  the 
SemiP  processing  time. 


5.  Conclusion 


I  have  presented  an  indirect  approach  to  test  two-sample  hypotheses  in  HSI.  The  approach  is  not 
based  on  a  physical  motivation  but  on  a  statistical  one.  Using  this  approach,  I  designed  two  fully 
unsupervised  object  detectors  ( SemiP  and  AsemiP)  for  HSI.  I  showed  performance  agreement 
between  these  detectors  through  ROC  curves,  and  other  means.  Performances  of  both  detectors 
in  the  visible  to  short-wave  infrared  region  of  the  spectrum  were  compared  to  perfonnances  of 
four  other  techniques.  The  proposed  algorithms  clearly  outperfonn  the  others.  The  asymptotic 
distribution  of  the  test  statistic  under  Ha  in  both  proposed  algorithms  is  independent  of  the 
unknown  parameters,  which  implies  that  both  SemiP  and  AsemiP  have  the  constant  probability- 
of-error  property.  With  this  property,  one  can — in  theory — select  a  decision  threshold  that  yields 
a  virtual  zero  probability  of  error.  Error  in  this  context  means  detection  of  non-anomalies,  which 
is  purely  based  on  class  similarities  between  test  samples  and  their  immediate  surroundings,  not 
necessarily  non-targets  (targets  defined  as  particular  objects  of  interest).  This  distinction  should 
be  noted. 
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Appendix  A 


Proof  of  the  SemiP  algorithm:  Lemma  1 A  is  relevant  to  estimators  based  on  function 
maximization  with  respect  to  unknown  parameters. 

Lemma  1A.26  Assumptions: 

(i)  Let  0  he  an  open  subset  of  the  Euclidean  K-space.  (Thus  the  true  value  0O  is  an  interior  point 
of  0) 

(11)  Qr(y,  0)  is  a  measurable  function  of  vector  y  for  all  0  i0  and  d(QT  /  GO  exists  and  is 

continuous  in  an  open  neighborhood  N i(0o)  of  0o.  (Note  that  this  implies  Q  fy,  0)  is  continuous 
ford  €  A;,  where  T  is  the  sample  size.) 

(iii)  There  exists  an  open  neighborhood  N2(0O)  of  0osuch  that  T~'  Qf  6)  converges  to  a 
nonstochastic  function  Q(6)in  probability  uniformly  in  0  in  N2(0O),  and  Q(0)  attains  a  strict 
local  maximum  at  0o. 

Let  0i  be  set  of  roots  of  the  equation 


corresponding  to  the  local  maxima.  If  that  set  is  empty,  set  0requals  to  {0}.  Then,  for  any  s  >0, 

(2A) 

lim  P[  inf  (0  -  0O )  {0  -  0O )  >  s]  =  0. 

T  — >  oo  OeQ'f 

26 

In  essence,  Lemma  1 A  affirms  that  there  is  a  consistent  root  of  (1  A).  (For  the  proof,  see"  ). 

Under  certain  conditions,  a  consistent  root  of  (1A)  is  asymptotically  Normal.  The  affirmation  is 
shown  in  Theorem  1A,  where  asymptotic  convergence  is  denoted  by  A  -*  B  ■ 

r_  26 

Theorem  1A.  Assumptions: 

(i)  All  the  assumptions  of  Lemma  1. 

d2QT 

(ii)  - —  exists  and  is  continuous  in  an  open,  convex  neighborhood  of  0O. 

GOdO 

82Qt  _i  G2Qt 

(iii)  - —  converges  to  a  finite  nonsingular  matrix  S  (6{])  =  lim  E[T  ( - —)Q  ]  in  probability 

G6d6  G6d6  0 

for  any  sequence  0T  such  that  0T  =  0{]. 
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(iv) 


Vr(^)0  ->  N[O,V{0o)\,  where 
dO  u 


V(0O  )  =  UmE[T-'(d^)0ox(^)0o]. 


(3A) 

(4A) 


Let  {  0T }  be  a  sequence  obtained  by  choosing  one  element  from  0 t  defined  in  Lemma  1  such 
that  Oj-  — ^  6q. 


Then 


ff(eT  -  6>0 )  ->  N( 0, 1),  where 


L  =  S(Oo)-1V(Oo)S(0or.- 


-l 


(5A) 

(6A) 


For  the  proof,  see  also.76 

The  semiparametric  model’s  MLE  solution  satisfies  the  assumptions  of  Lemma  1A,  including  of 
course  (1A)  via  (11)  and  (12).  Therefore,  by  Lemma  1,  (a* ,ff)  is  consistent  and,  as  we  shall  see 
by  Theorem  1A,  it  converges  asymptotically  to  a  Nonnal  distribution. 

Under  H0:  ()=0  (gi=go),  I  shall  use  the  following  notation  for  the  moments  of  t  (the  combined 
sample  vector)  with  respect  to  the  reference  distribution  go'. 

E(tk  )  =  j  tk  g0{t)dt ,  (?A) 

Var  (t)  =  E  (t2)  -  E2  (t) 


Let  ( otofio)  be  the  true  value  of  (a,fi)  under  model  (3)-(4)  and  assume  p  =  nfino  remains  constant 
as  both  n]  and  no  go  to  infinity.  Define  V  =  and  notice  from  (11)  and  (12)  that 

E[V/  (a0,fi0 )]  =  0  .  Under  the  null  hypothesis  {H0:  J3=  0  [g/  =  go]),  using  (4),  (9),  (11),  (12) 
and  (7A)  one  can  verify  that 


1  d2l(a0,fi q)  ^ 
n  dad/3 


texp(a0+fl0t) 

l+pexp(a0+fi0t) 


go  (0  dt 


=  Ki\  *  go (0[exp(a0  +  fi0 t)g0 (t)]  dt 


P 

1 +  P 


E(t ), 


(8A) 


26 


Amemiya,  1985. 


24 


where  Kj  and  AT  are  constants  involving  ( rii,no )  and  p/(l+  p)  =  nfn  (where  n  =  m+no).  Using 
similar  argument  to  arrive  at  (8  A)  and  applying  the  weak  law  of  large  numbers  (WLLN)  (see,  for 
instance,  ),  one  can  use  assumption  (Hi)  in  Theorem  1 A  to  recognize  that 


~-VVl(a0,j30) 

n 


>S  = 


p  f  1  E( if 

i +p{m  E(f) ) 


(9A) 


in  probability  as  n  — >  oo .  It  follows  that  S  is  nonsingular  and  its  inverse  is 


S'1 


1 

E{t2)-E2{t ) 


rE(t2) 

~E(t) 


~E(t)' 

1  J 


1 +  P 
P 


(10A) 


Our  interest  is  only  in  the  parameter  /?,  so,  let  Sp  denote  the  lower-right  component  of  the 
expanded  version  of  S  and  use  (7A)  to  obtain 

1  \  +  P  _  1  1 +  P  ^11A^ 

P  E(t2)-E2(t )  P  Var(t)  p 


Using  also  the  application  of  the  central  limiting  theorem  (CLT)  in  Theorem  1 A  (iv)  and  the  fact 
that 

E[Vl(a0,f  0)]=0,  (12A) 


from  (11)  and  (12),  one  can  write 

■Jn  [V/(a0,/?o)]  ->  N\Q,V(giq, Pf)\, 


where 


V{ccq,Pq) 


p 

i +p 


1 

E(t) 


E(t ) 
E(t 2) 


P 


1 

E(t) 


[1 


E(t). 


(13A) 


(14A) 


V(ao,po)  is  a  direct  result  from  (4A),  see,  for  instance16.  Using  the  conclusion  of  Theorem  1A,  or 
(5A)-(6A),  in  terms  of  Sp  in  ( 1 1  A)  and  the  lower-right  component  of  the  expanded  version  of 
V(ao,(3o)  in  (14A),  one  can  verify  that 


■  N 


0, 


pTi  +  A) 

Var(t )  ^ 


(15A) 


16  Qin,  1997. 

2 1  Lehmann,  1991. 
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and  having  the  left  side  of  ( 1 5 A)  nonnalized  by  the  asymptotic  variance  and  then  squared,  one 

22 

can  conclude  (see  convergence  results,  for  instance,  in“  )  that  the  resulting  random  variable 

x  =  np(\  +  p)~2  p*2v*  (t)  — »  X\  <16A> 

converges  to  a  chi  square  distribution  with  1  degree  of  freedom,  where  V* (t)  estimates  Var(t).  A 
multivariate  solution  is  presented  in17. 


17  Fokianos,  2001. 
22  Casella,  1990. 
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Appendix  B 


Proof  of  Proposition  1:  If  hypothesis  IIq:  [f  =  0;  f f ,  =  l)  is  true  in  Proposition  1,  then 
cr2  =  cr2  =  cr2  and,  using  the  independent  assumptions  of  xj  and  xo,  and  CLT,  it  follows  that 


4  + 


■N( 0,1), 


In  0  n\  ^  ^|n0  n\  1  n\—> 


P  as  defined  in  Proposition  1 ;  in  addition,  the  following  estimators  of  cr 2  and  a,2  are  known  to 


26 

be  consistent  [see,  for  example,  ]: 


"0  _ 
M-%  -Xq) 


S?  =  - - - ,  and  S2=^ - - . 

/7j  - 1  «0  _  t 

Using  both  samples  x/  and  xo,  let  the  following  be  another  estimator  of  cr2  (or  cr,2 ),  given  that 
under  1 1 o  cr 2  =  cr 2 , 

52  =  (ii1-1)512+(«0-1)502  ( 

(«l  “  1)  +  («o  “ !) 

2  2  2  2 

The  estimator  S'-  is  unbiased  under  Ho,  as  its  expected  value  E[S  ]  is  equal  to  erf  and  erf  : 

£(c2)  (»i-l)^l2)+(»0-l)^o)  6 

(Wi  —  1)  +  (»o  —  1) 

=  (»1  -l)Q'l2  +(»Q  -1)«7q 
(»i  —  l)  +  (»o  — !) 


9  9  9  9 

True  because  S0  and  Sf  are  consistent  estimators  and,  under  H0 ,  erf  =  erf .  I  want  to  prove  now 
a  WLLN  for  S'  to  verify  that  S~  is  also  a  consistent  estimator.  Using  Chebychev ’s  inequality" 
under  Ho. 

■>.  .  E(S2-al)2  _Var(S2)  (5B) 


P{\S2-c7i\>8)< 


and,  thus,  a  sufficient  condition  that  S2  converges  in  probability  to  cr2 ,  or  cr,2 ,  is  that 

Ms2)  >o- 
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Amemiya,  1985. 
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Serfling,  1980. 
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2 

Note  that  Var(S')  can  be  expressed  as 

and,  as  S2  and  S2  are  both  consistent  estimators,  their  variances  must  converge  to  zero, 


Vat{Sf)- 


n\—> 00 


->0;  Var{S^)- 


fl0— » CO 


->0, 


(6B) 


(7B) 


also 


Then  Var(S~)- 


1  - 

.(«i-l)+(n0-l) . 


(n„n0)-x> o 


L(«i-l)+(n0-l)  J  (/*! ,/i0 )— >oo 

->0  and,  therefore,  under  lid. 


>  0;  k  =  0,1 . 


(8B) 


^_2  2 

^  0  6X  1  i  /  \ 

^r  =  ^r->i.  as  oo. 


(9B) 


Using  the  same  argument  to  arrive  at  (9B),  we  can  also  show  that  under  Ho. 

%  =  =Cl  =  b  as  77  =  («,+«„)->•  00, 

<r0  cr, 

where  5, 2  is  a  consistent  estimator  of  cr2  under  Ho,  or 


S2=(77-ir1X(f-02,  i  =  n~'Y,k- 

i= 1  1=1 


(10B) 


(1  IB) 


9  . 

Furthermore,  since  5)  is  the  sample  variance  of  t  (the  combined  vector)  under  Ho,  one  can 
readily  verify  that 


S  S 

—  =  —  —>1,  as  n  -  (i\  +«0)  — >co 
ao  ai 


(12B) 


26 

(see,  for  example,  Chapter  5). 

To  finalize  the  proof,  consider  Theorem  IB  below. 


26 


Amemiya,  1985. 
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Theorem  IB  (Slutsky).  Let  Xn  tend  to  X  in  distribution  and  Yn  tend  to  c  in  probability,  where  c 
is  a  finite  constant.  Then 

(i)  Xn  +  Yn  tend  to  X+c  in  distribution; 

(ii)  Xn  Yn  tend  to  cX  in  distribution; 

(Hi)  Xn/Yn  tend  to  X/c  in  distribution,  if  c  is  not  zero. 

28 

See  proof  in. 

Using  (IB),  (9B),  (12B)  and  the  Slutsky  Theorem,  one  can  conclude  that 


P 


-4  +  —cr0 
V  v  "°  "l  J 


(  -  2^ 


KaOj 


flQ  — ^  GO 
fl\  — >00 


^(0,1) 


(13B) 


22 

and  that  by  squaring  (13B)  and  using  convergence  results  from,"  one  can  also  conclude  that 


8 1  s: 


(14B) 


H  0  — ^  GO 

n\  — >  oo 


->  X  l 


which  can  be  readily  refonnatted  into  (23)  using  the  definitions  given  in  Proposition  1  and  in  this 
proof.  Equation  (14B)  concludes  the  proof. 


22  Casella,  1990. 
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Serfling,  1980. 
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